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ABSTRACT 


Skill prerequisite information is useful for tutoring systems that as- 
sess student knowledge or that provide remediation. These systems 
often encode prerequisites as graphs designed by subject matter 
experts in a costly and time-consuming process. In this paper, we 
introduce Combined student Modeling and prerequisite Discovery 
(COMMAND), a novel algorithm for jointly inferring a prerequisite 
graph and a student model from data. Learning a COMMAND 
model requires student performance data and a mapping of items to 
skills (Q-matrix). COMMAND learns the skill prerequisite relations 
as a Bayesian network (an encoding of the probabilistic dependence 
among the skills) via a two-stage learning process. In the first stage, 
it uses an algorithm called Structural Expectation Maximization to 
select a class of equivalent Bayesian networks; in the second stage, 
it uses curriculum information to select a single Bayesian network. 
Our experiments on simulations and real student data suggest that 
COMMAND is better than prior methods in the literature. 
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1. INTRODUCTION 


Course curricula are usually organized in a meaningful sequence 
that evolves from relatively simple lessons to more complex ones. 
Among these lessons, some are required to be mastered by the 
student before the subsequent ones can be learned. For instance, 
students have to know how to do addition before they learn to do 
multiplication. We refer to prerequisite structure as the relationships 
among skills that place strict constraints on the order in which skills 
can be acquired. 


Prerequisite structures are crucial for designing intelligent tutoring 
systems that assess student knowledge or that offer remediation 
interventions to students. Building such systems require prerequisite 
information that is often hand-engineered by subject matter experts 
in a costly and time-consuming process. Moreover, the prerequisite 
structures specified by the experts are seldom tested and might be 
unreliable in the sense that experts may have “blind spots". 
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Recent interest in computer assisted education promises large amounts 
of data from students solving items— questions, problems, parts 
of questions. Performance data —what items a learner answers 
correctly— can be used to create student models. These models rep- 
resent an estimate of skill proficiency at a given point in time [17]. 
For example, a student model can represent that Alice has already 
mastered integer addition, but Bob has not. Student models are often 
used to personalize instruction in tutoring systems or to predict fu- 
ture student performance. In this paper, we introduce Combined stu- 
dent Modeling and prerequisite Discovery (COMMAND), a novel 
algorithm for simultaneously discovering prerequisite structure of 
skills and a student model from student performance data. 


2. RELATION TO PRIOR WORK 


Prior work has investigated how to discover prerequisites among 
items without considering their mapping into skills [6, 19]. Item-to- 
skill mappings (also called Q-matrices) are desirable because they 
allow more interpretable diagnostic information. Because of this, 
follow-up work [2, 4] has studied whether a pair of skills have a 
prerequisite relationship or not. For this, we can measure if a model 
that assumes a dependency between the two skills explains the data 
better than a model that assumes independence. This comparison 
can be done with data likelihood [2] or association rule mining [4]. 
Although promising, prior methods have limitations that we address: 


1. We estimate the global prerequisite structure, not just the 
pairwise relationships. For example, suppose we want to 
discover the prerequisites of three skills for English learning 
(Sj :syntax, $2:cohesion and $3:lexical rules). If we use prior 
methods, we discover that the three skills are related among 
each other. However, pairwise methods are unable to tell if 
the relationships are due to indirect (e.g, $3 + Sz — $1), or 
direct (e.g, $3 > Sp + S}) effects. 

—_ 


2. It is unclear how to use the output of these prerequisite struc- 
tures for student modeling. For example, it is not obvious 
how to best use them to make predictions of future student 
performance. 


3. Prior work does not provide quantitative evaluation using real 
student data. Overall, learner data has been used to provide 
examples, but without any methodology that can help compare 
what algorithm works better. 


A Statistical formalism called Bayesian network has been useful 
to model prerequisite structures [12]. Bayesian networks allows 
modeling the full structure of skills (beyond pairwise relationships) 
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Figure 1: A hypothetical Bayesian network. Solid edges are given 
by item to skill mapping, dashed edges between skill variables are 
to be discovered from data. The conditional probability tables are to 
be learned. 


and can encode conditional independence between the skills. Un- 
fortunately, prior work with Bayesian networks requires a domain 
expert to design the prerequisite structures [10], and automatic tech- 
niques have not been demonstrated with real student data [14]. We 
now describe the COMMAND algorithm that discovers a Bayesian 
network that encodes the prerequisite structure of skills. 


3. THE COMMAND ALGORITHM 
COMMAND learns the prerequisite structure of the skills from data 
with a statistical model called Bayesian network [13, 15]. Bayesian 
networks are one type of probabilistic graphical models because 
they can be represented visually and algebraically as a collection 
of nodes and edges. A tutorial description of Bayesian networks in 
education can be found elsewhere [12], but for now we say that they 
are often described with two components: the nodes represent the 
random variables, which we describe using conditional probability 
tables (CPTs), and the set of edges that form a directed acyclic 
graph (DAG) represent the conditional dependencies between the 
variables. Bayesian networks are a flexible tool that can be used to 
model an entire curriculum. 


Figure | illustrates an example of a prerequisite structure modeled 
with a Bayesian network. Here, we relate four test items with the 
skills of addition and multiplication. Addition is a prerequisite of 
multiplication thus there is an arrow from addition to multiplication. 
Modeling prerequisites as edges in a Bayesian network allows us 
to frame the discovery of the prerequisite relationships as the well- 
studied machine learning problem of learning a Bayesian network 
from data with the presence of unobserved latent variables. We 
represent the prerequisite structure using Bayesian networks that 
use latent binary variables to represent the student knowledge of a 
skill (i.e., mastery or not mastery), and observed binary variables 
that represent the student performance answering items (i.e., correct 
or incorrect). 


Algorithm 1 describes the COMMAND pipeline. The input to COM- 
MAND is a matrix D with n x p dimensions, representing n students, 
answering p items. Each entry in D encodes the performance of 
a student (see Table 1 for an example). Additionally, we require 
a Q-matrix to represent the item-to-skill mapping. Q-matrices are 
often designed by subject matter experts but automatic methods to 
discover them exist [8]. 


Table 1: Example student performance matrix to use with COM- 
MAND. The performance of a student is encoded with | if the 
student answered correctly the item, and 0 otherwise. 


User Item1 Item2 Item3 Itemp 


Alice 0 1 0 
Bob 1 1 oe 1 
Carol 0 0 1 


Algorithm 1 The COMMAND algorithm 


Require: A matrix D of student performance on a set of test items, 
skill-to-item mapping Q (containing a set of skills S). 
: Go « Initialize(S,Q) 
ic0 } Initialization 
do 
E-step: 
©; < ParametricEM(G;,D) 
D; < Inference(G;, 0 ,D) 
M-step: 
(Gi+1, 9:41) <- BNLearning(G;,D7) 
icitl 
: while stop criterion is not met 
: RE < FindReversibleEdges(G;) 
: EC © EnumEquivalentDAGs(G;) 
: DE + {} 
: for every reversible edge S; —S; in RE do 
P(S;=0|S;=0) 1 
P(S;=0]5;=0) 
if ratio > 1 then 
ratio* = ratio 
DE — DEUS; > Sj; 
else 
ratio* = ats Discriminate 
DE — DEUS; — Sj; between equiv- 
end if alent BNs 
: end for 
: sort(DE) by ratio* in descending order 
: while DE is not empty do 
e + dequeue(DE) 
if IG ¢ ECe € G then 
VG € EC, remove G from EC ife¢ G 
end if 
: end while 
: return EC 
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COMMAND relies on a popular machine learning algorithm called 
Structural Expectation Maximization (Structural EM), which to the 
extent of our knowledge has not been used in educational applica- 
tions before. Structural EM extends the Expectation Maximization 
(EM) algorithm to allow efficient structure learning of Bayesian 
networks when there are latent variables or missing values in the 
data. A secondary contribution of our work is introducing Structural 
EM for learning Bayesian network structures from educational data. 
We now describe the steps of COMMAND in detail. 


3.1 Initial Bayesian Network 
COMMAND first creates an initial Bayesian network using the Q- 
matrix by creating an arc to each item from each of its required 


'P(S; = a|S |; = ) can be computed using any Bayesian network 
inference algorithm such as Junction tree algorithm [11]. 
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Figure 2: An illustration of the Structure EM algorithm to discover the 
structure of the latent variables. G represents the DAG structure. © is the set 
of conditional probability tables (CPTs). 
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skills. Because there are no edges between the skills, this initial 
network does not encode any prerequisite information. COMMAND 
uses Structural EM to learn arcs (prerequisites) between the skill 
variables. 


3.2 Structural EM 


A common solution to learning a Bayesian network from data is 
the score-and-search approach [5, 9]. This approach uses a scoring 
function (like the Bayesian Information Criterion (BIC)) to mea- 
sure the fitness of a Bayesian network structure to the observed 
data, and it attempts to find the optimal model in the space of 
all possible Bayesian network structures. However, the conven- 
tional score-and-search approaches rely on efficient computation 
of the scoring function, which is only feasible for problems where 
data contain observations for all variables in the Bayesian network. 
Unfortunately, our domain has skill variables that are not directly 
observed. An intuitive work-around is to use EM to estimate the 
scoring function. However, in this case EM takes a large number 
(hundreds) of iterations that require Bayesian network inference, 
which is computationally prohibitive. Further, we need run EM 
for each candidate structure, and the number of possible Bayesian 
network structures is super-exponential with respect to the number 
of nodes. The Structural EM algorithm [7] is an efficient alternative. 


Structural EM is an iterative algorithm that inputs a matrix D of 
student performance (see example Table 1). Figure 2 illustrates one 
iteration of the Structural EM algorithm. The relevant steps are also 
sketched in Algorithm 1. Each iteration consists of an Expectation 
step (E-step) and a Maximization step (M-step). In the E-step, it first 
finds the maximum likelihood estimate ©* of the CPTs for the cur- 
rent structure G calculated from previous iteration using parametric 
EM. It then does Bayesian inference to compute the expected values 
for the latent variables using the current model (G, ©*), and uses the 
values to complete the data. In the M-step, it uses the conventional 
score-and-search approach to optimize the structure according to the 
completed data (as if the latent variables were observed). Since the 
space of possible Bayesian network structures is super-exponential, 
exhaustive search is intractable and local search algorithms, such 
as greedy hill-climbing search, are often used. The E-step and 
M-step interleave and iterate until some stop criterion is met, e.g., 
the scoring function does not change significantly. Contrast to the 
conventional score-and-search algorithm, Structural EM runs EM 
only on one structure in each iteration, thus is computationally more 
efficient. 


We use an efficient implementation of Structural EM available on- 
line called LibB*. Because COMMAND ’’s initialization step fixes 
the arcs from skills to items according to the Q-matrix, the M-step 


2http://compbio.cs.huji.ac.il/LibB/programs.html 


only needs to consider the candidate structures that comply with 
the Q-matrix. An advantage of using Structural EM to discover the 
prerequisite relationship of skills is that it can be easily extended 
to incorporate domain knowledge. For example, we can place con- 
straints on the output structure to force or to disallow a skill to be a 
prerequisite of another skill. Another advantage of Structural EM 
is that it can be applied when there are missing data in the student 
performance matrix D [7]. That is, some students do not answer 
all the items. The general idea is, in the E-step, the algorithm also 
computes the expected values for missing data points, in addition 
for latent variables. 


3.3 Discriminate Between Equivalent BNs 
Structural EM selects a Bayesian network model based on how well 
it explains the distribution of the data. Bayesian network theory 
states that some Bayesian networks are statistically equivalent in 
representing the data. Thus, the output from Structural EM is ac- 
tually an equvaloncs class (EC) that may contain many Bayesian 
network structures*. These equivalent Bayan networks have the 
same skeleton and the same v-structures*. For instance, Figure 3 
gives an example of a simple equivalence class containing three 
Bayesian networks that are not distinguishable by Structural EM 
algorithm and the method in [14]. They share the skeleton but differ 
in the orientation of at least one of the edges (we will call such an 
edge a reversible edge). They apparently represent three different 
prerequisite structures. 


Figure 3: Three equivalent Bayesian networks representing different 
prerequisite structures. 


3.3.1 Domain Knowledge 

To determine a unique structure, we use a heuristic based in domain 
knowledge to determine the orientation of each reversible edge. For 
convenience in notation, let’s assume that the random variables that 
represent skill proficiency can take two values: 0 if the skills is not 
mastered, and | if the skill is mastered. Our assumption is that if 
a skill S; is the prerequisite of a skill Sz, a student can not master 
skill Sz before she masters S;. More formally: 


Assumption. If S; is a prerequisite of S» (i.e., S; — Sz), then 
=0= Sz =0. In other words, P(Sz = 0|S; =0) = 1. 


Our assumption implies that S; cannot be a prerequisite of S2 if 
P(S2 = 0|S, =0) = 1 does not hold. This puts a constraint on the 
joint distribution encoded by the Bayesian network to be learned. 


For example, consider the case of choosing the orientation of a 
reversible edge S; — Sz from S; «+ S2 or S; — S$. We can check 
whether P(S2 = 0|S; = 0) = 1 or P(S; = 0|S2 = 0) = 1. However, 
it is possible that our assumption does not hold, and a student 
got to master a skill even if he does not know the prerequisite. 
Moreover, because of statistical noise, the conditional probability 
P(Sz = 0|S; = 0) may not be exactly 1. Thus, we use the following 
empirical rule: 


3Structural EM outputs a DAG. However, the scoring function does 
not discriminate between the many DAGs of the equivalence class. 
4A y-structure with nodes u,v,w in a DAG are the directed edges 
u— vand w — v and u and w are not adjacent in the DAG [18]. 
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Rule 1. if P(S, = 0|S; = 0) > P(S, = 0|S, = 0), we determine 
S, — S2; otherwise, we determine S$; + S$. 


Note that these two conditional probabilities can be computed eas- 
ily from the Bayesian network model output from Structural EM. 
The intuition behind this rule is that the conditional probability 
P(S = 0|S, = 0) can be interpreted as the strength of the prerequi- 
site relationship S; — Sj. The larger of this probability, the more 
likely the relationship S$; —> Sz holds. Since here we are concerned 
with which direction the edge goes, we simply compare the two 
probabilities and select the direction that is more probable. Note 
that P(Sp = O|S; = 0) = 1 and P(S, = 0|Sy = 0) = 1 may hold 
simultaneously. If 5S; — S2 is true, P(S; = 0|S; = 0) = 1 only 
if P(S; = 1) =0 or if P(S, = 0|S; = 1) = 02 If P(S; = 1) =0, 
this implies that no student knows S,. If P(S; = 0|S; = 1) =O, it 
means that learning Sj becomes trivial once students know S,. For 
simplicity, we ignore this extreme case. 


3.3.2. Theoretical Justification of Heuristic 
We now provide theoretical justification for the rule we propose. 
Consider a simple equivalence class, which contains two equivalent 
DAGs S$, — Sz and S; + S>, where the true model is S; — S3. We 
have three free conditional probability parameters: P(S; = 0) = p, 


P(S) = 0|S; = 0) =q, P(S2 = 1|S; = 1) =r. Let’s define a ratio 
that quantifies choosing the true model: 
_ P(Sz =0|S; = 0) 
iio = 1 
mae PS) = 0[S> = 0) “ 


Using Bayes rule and rules of probability, the rule ratio > 1 becomes 
(1— p)(1—r) — p(1—@) = 0. Since ratio depends on p, g and r, we 
study how ratio changes with these parameters. Figure 4 shows the 
contour plots of Jog(ratio) against P(S; = 0) and P(Sz = 1|S; = 1) 
for three different values of P(S2 = 0|S; = 0). The white region 
in each contour plot is the region where our heuristic fails because 
ratio < 1. Figure 4(a) shows that when P(S2 = 0|S; =0) =q=1, 
our heuristic rule is always correct, no matter what, because there 
is no white space. With P(S2 = 0|S; = 0) decreasing, the white 
region becomes larger and the rule becomes less accurate. As 
mentioned, P(S2 = 0|S; = 0) can be interpreted as the strength of 
the prerequisite relationship. If we fix the value of P(S2 = 0|S; =0) 
and assume that the two free parameters p and r are independent and 
uniformly distributed, then the area of the white region represents 
the probability that the rule makes a wrong decision. As the strength 
of the prerequisite relationship gets weaker, our rule to determine 
the prerequisite relationship becomes less accurate. 


P(S2=0|S1=0)=1.0 


P(S2=01S1=0)=0,95 P(S2=0|S1=0)=0.9 


Figure 4: Contour plots of /og(ratio) against P(S; = 0) and P(S2 = 
1|S; = 1) for various values of P(Sz = 0|S; = 0). 


3.3.3 Orient All Reversible Edges 


Using our proposed rule, we can orient every reversible edge in 
the network structure. However, orienting each reversible edge is 


5a: = _ _ P(S2=0|S =0)P(S =0) 
Since P(S; =0|S2 =0) P(S=0[51=0)P(Si VTP: “OS; D)P(i=l)’ 
P(S, =0|Sz =0) = 1 only if P(Sy = 0)S, = 1)P(S, = 1) =0. 


not independent and may conflict with each other. Having oriented 
one edge would constrain the orientation of other reversible edges 
because we have to ensure the graph is a DAG and the equivalence 
property is not violated. For example, in Figure 5a, if we have 
determined $; — S>, the edge Sy — $3 is enforced. In this paper, we 
take an ad-hoc strategy to determine the orientation for all reversible 


edges. For each reversible edge S; — S;, we let ratio* = ratio if 


ratio > 1 and ratio* = = oH otherwise. The larger the ratio* is, the 


more confidently when we decide the orientation. We sort the list of 
reversible edges by ratio* in descending order. We then orient the 
edges by this ordering. In our implementation, we use the following 
strategy: we first enumerate all equivalent Bayesian networks and 
make them a list of candidates; when an edge is oriented to S$; + Sj, 
we remove all contradicting Bayesian networks from the list. Even- 
tually only one Bayesian network structure stands. This procedure is 
detailed in the Discriminate between equivalent BNs section of Algo- 
rithm 1. The EnumE quivalent DAGs(G;) implements the algorithm 
of enumerating equivalent DAGs in [3]. 


4. EVALUATION 

In § 4.1, we evaluate COMMAND with simulated data to assess the 
quality of the discovered prerequisite structures. Then, in § 4.2 we 
use data collected from real students. In all our experiments, we use 
BIC as the scoring function in Structural EM . 


4.1 Simulated Data 

Synthetic data allow us to study how COMMAND compares to the 
ground truth. For this, we engineered three prerequisite structures 
(DAGs), shown in Figure 5. Here, each figure represents different 
causal relations between the simulated latent skill variables. 


Y 
(a) Structure 1 (S2) 


(b) Structure 2 (c) Structure 3 


Figure 5: Three different DAGs between latent skill variables. Item 
nodes are omitted. 


For clarity, Figure 5 omits the item nodes; but each skill node is 
parent of six item variables and each item variable has 1-3 skill nodes 
as parents. All of these nodes are modeled using binary random 
variables. More precisely, the latent nodes represent whether the 
student achieves mastery of the skill, and the observed nodes indicate 
if the student answers the item correctly. Notice that these networks 
include the prerequisite structures as well as the skill-item mapping. 


We consider simulated data with different number of observations 
(n = 150,500, 1000, 2000). For each sample size and each DAG, we 
generate ten different sets of conditional probability tables randomly 
with three constraints. First, we enforce that achieving mastery of the 
prerequisites of a skill will increase the likelihood of mastering the 
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skill. Second, for each prerequisite pair S; > S;, P(S; = 0|S; = 0) 
is randomly selected to be in [0.9, 1.0]. Finally, mastery of a skill 
increases the probability of student correctly answering the test item. 
In total we generated 120 synthetic datasets (3 DAGs x 4 sample 
sizes x 10 CPTs), and report the average results. 


We evaluate how well COMMAND can discover the true prerequi- 
site structure using metrics designed to evaluate Bayesian networks 
structure discovery. In particular, we use the F, adjacency score and 
the F, orientation score. The adjacency score measures how well 
we can recover connections between nodes. It is a weighted average 
of the true positive adjacency rate and the true discovery adjacency 
rate. On the other hand, the orientation score measures how well we 
can recover the direction of the edges. It is calculated as a weighted 
average of the true positive orientation rate and true discovery ori- 
entation rate. In both cases, the F; score reaches its best value at 
1 and worst at 0. Moreover, for comparison, we compute the F; 
adjacency score for Bayesian network structures whose skill nodes 
are fully connected with each other. These fully connected DAGs 
will serve as baselines for evaluating the adjacency discovery®. For 
completeness, we list these formulas in tables 2 and 3, respectively. 


Table 2: Formulas for measuring adjacency rate (AR) 


Metric Formula 
was # of correct adjacencies in learned model 
True positive (TPAR) # of adjacencies in true model 
: # of correct adjacencies in learned model 
True discovery (TDAR) # of adjacencies in learned model 
F,-AR 2:TPAR-TDAR 
i TPAR+TDAR 


Table 3: Formulas for measuring orientation rate (OR) 


Metric Formula 


# of correctly directed edges in learned model 
# of directed edges in true model 
# of correctly directed edges in learned model 
# of directed edges in learned model 
2-TPOR-TDOR 
TPOR+TDOR 


True positive (TPOR) 
True discovery (TDOR) 
F\|-OR 


We use these metrics to evaluate the effect of varying the number 
of observations of the training set (sample size) on the quality of 
learning the prerequisite structure. We designed experiments to 
specifically answer the following four questions: 


1. How does the type of items affect COMMAND’s ability to 
recover the prerequisite structure? We consider the situation 
where in the model each item requires only one skill and the 
situation where each item requires multiple skills. 

2. How well does COMMAND perform when there is noise in 
the data? We focus on studying noise due to the presence of 
unaccounted latent variables. 

3. How well does COMMAND perform when the student per- 
formance data have missing values? 

4. How is COMMAND compared with other prerequisite dis- 
covery methods? In particular, we compare COMMAND to 
the Probabilistic Association Rules Mining (PARM) method 
[4]. 


We now investigate these questions. 


We do not compute F; orientation score for fully connected DAGs 
because all edges in a fully connected DAG are reversible. 


4.1.1 Single-skill vs Multi-skill Items 

We consider two situations where different types of Q-matrix are 
used. In the first situation, each item node maps to exactly one skill 
node. In the second one, each item maps to 1-3 skills. Figure 6 
compares the F; of adjacency discovery and edge orientation results 
under the two types of Q-matrices. With only 500 observations, 
COMMAND improves on a fully connected Bayesian network base- 
line. COMMAND’ accuracy improves with the amount of data, but 
its accuracy is slightly lower when the Q-matrix contains items that 
require more than one skill. A possible explanation for this is that 
multi-skill items may introduce more spurious correlations in the 
data. With just 2000 observations, COMMAND recovers the true 
structures almost perfectly. 
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Figure 6: Comparison of F; scores for adjacency discovery (top 
row) and for edge orientation (bottom row). Horizontal lines are 
baseline scores for fully-connected (complete) networks. The error 
bars show the 95% confidence intervals, i.e., +1.96*SE. 


4.1.2 Sensitivity to Noise 

Real-world data sets often contain various types of noise. For exam- 
ple, noise may occur due to latent variables that are not explicitly 
modeled. To evaluate the sensitivity of COMMAND to noise, we 
synthesize the three Bayesian networks in Figure 5 to include a 
StudentAbility node that takes three possible states (low/med/high). 
In these Bayesian networks, students’ performance depends not only 
on whether they have mastered the skills, but also on their individual 
ability. For simplicity, all items in the setting are single-skilled 
items. We first simulated data from Bayesian networks that have a 
StudentAbility variable to generate “noisy” data samples, and then 
use this data to recover the prerequisite structure. Figure 7 illustrates 
the procedure of this sensitivity analysis experiment for Structure 1. 
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Figure 7: Evaluation of COMMAND with noisy data 
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Figure 8: Results of adding systematic noise. Top: Comparison of 
F, scores for adjacency discovery. Horizontal lines are baseline F; 
scores computed for fully connected Bayesian networks. Bottom: 
Comparison of F; scores for edge orientation. 


Figure 8 compares the results where noise was introduced or not. 
Interestingly, the noise actually improves COMMAND’s accuracy. 
This improvement is more evident when the sample size is small 
(see n = 150). For smaller sample sizes, Structural EM usually 
discovers less relationships than actually exist, because BIC prefers 
sparse structures. We hypothesize that the correlations caused by 
StudentAbility node would cause Structural EM to add “stronger” 
edges between skill nodes, resulting in higher F1. 


4.1.3 Sensitivity to Missing Values 

Real-world datasets collected from students often have missing 
values, for example, when learners do not answer all items. To 
evaluate how COMMAND performs on data with missing values, 
we generated data sets of with 1000 observations with varying 
fraction of randomly missing values (10%, 20%, 30%, 40%, 50%). 
We used COMMAND to recover the structures from these data sets. 
Again, the models only contain single-skilled items. Figure 9 shows 
the results of this experiment. Although accuracy decreases when 
the fraction of missing values increases, COMMAND is able to 
recover the true structures for Structure 1 and 2 even when the data 
contain up to 30% missing values. 
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Figure 9: Results of learning with missing data. Left: Comparison 
of F, scores for adjacency discovery. Horizontal lines are baseline 
F, scores computed for fully connected Bayesian networks. Right: 
Comparison of F; scores for edge orientation. 


4.1.4. Comparison With Prior Work 

The Probabilistic Association Rules Mining (PARM) is a recent 
algorithm for discovering the prerequisite relationships between 
skills [4]. In this approach, a prerequisite relationship S; — Sp is 
considered to exist if P(S; = 1,52 = 1) > minsup A P(S; = 1|Sp = 
1) > minconf) > minprob and P(P(S, = 0,S2 = 0) > minsup A 
P(Sz =0|S; =0) > minconf) > minprob, where minsup, minconf 
and minprob are pre-specified constants between 0 and 1. 


We simulate data from Structure 3 from Figure 5(c) (with single- 
skilled items), which has 21 pair-wise prerequisite relationships. We 
derive pair-wise prerequisite relationships from this network and 
see how the two approaches discover these relationships. When ex- 
perimenting with PARM, we use minsup = 0.125, minconf = 0.76, 
minprob = 0.9, because they were suggested by the authors [4]. 


PARM is limited to discovering pair-wise prerequisite relationships 
(instead of constructing the full structure). To make a fair compari- 
son, we evaluate how accurately COMMAND and PARM discover 
relationship pairs. For this, we use the F1 metric in Table 2, but 
we count pairs of related skills instead of adjacencies. Two skills 
are related if one is a descendant of the other one. Figure 10 shows 
that COMMAND outperforms PARM, and the difference becomes 
significant for sample size n > 500. The low F;, score of by PARM 
is because it fails to discover many prerequisite relationships (data 
not shown), and because PARM does not respect transitivity. For 
example, PARM may reject S; —> S3 even it has discovered S$; + Sp 
and Sy —> $3. We speculate that selecting a different set of cutoff 
values for PARM may improve the results. However, determining 
these thresholds is not trivial and may require experts’ intervention. 
By contrast, COMMAND does not require tuning. 
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Figure 10: Comparison of COMMAND and PARM for discovering 
prerequisite relationships in Structure 3. 


4.2 Real Student Performance Data 
We now evaluate COMMAND using two real-world data sets. 


4.2.1 English Data Set 

The Examination for the Certification of Proficiency in English 
(ECPE) dataset describes 2922 examines in their understanding 
of English language grammar [16]. The dataset includes student 
performance in 28 items on 3 skills (Sj: morphosyntactic rules, S2: 
cohesive rules, and S3:lexical rules). Each item requires either one 
or two of the three skills. 


Figure 11 shows the prerequisite structure discovered with COM- 
MAND. It hypothesizes that lexical rules is a prerequisite of cohe- 
sive rules and morphosyntactic rules; cohesive rules is a necessary 
skill for learning morphosyntactic rules. The pair-wise prerequisite 
relationships totally agrees with the findings in [16] and that by the 
PARM method in [4]. Our model infers a complete DAG, suggest- 
ing that there are no conditional independencies among the three 
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Figure 11: The estimated DAG and CPTs of the ECPE data set. 


skills. This is an interesting insight that previous approaches cannot 
provide. Further, COMMAND also outputs the conditional prob- 
abilities associated with each skill and its direct prerequisite. We 
clearly see that the probability of student mastering a skill increases 
when the student has acquired more prerequisites of the skill. 


4.2.2 Math Data Set 

We now evaluate COMMAND using data collected from a commer- 
cial non-adaptive tutoring system. The textbook items are classified 
in chapters, sections, and objectives. We only use student perfor- 
mance data from tests in Chapter 2 and 3. That is, students are tested 
on the items after they have been taught all relevant skills. 


Q-matrix and preprocessing. We define skills as book sec- 
tions. We use a Q-matrix that assigns each exercise to a skill solely 
as the book section in which the item appears.’ For each chapter, 
we process the data to find a subset of items and students that do not 
have missing values. That is, the datasets we use in COMMAND 
have students responding to all of the items. 


After filtering, two data sets, Math-chap2 and Math-chap3, were 
obtained for Chapter 2 and 3 respectively. In Math-chap2, six 
skills are included and each skill is tested on three to eight items, 
for a total of 30 items. In Math-chap3, seven skills are included 
and each skill has three to seven items, for a total of 33 items. 
Math-chap2 includes student test results for 1720 students, while 
the Math-chap3 has test results for 1245 students. For simplicity we 
use binary variables to encode performance data and skill variables. 


Prerequisite Structure Discovery. The Bayesian networks 
generated with the COMMAND algorithm are illustrated in Fig- 
ure 12. Our observation is that the topological order of the sections 
in both structures are fully consistent with the book ordering heuris- 
tic. This shows an agreement between our fully data-driven method 
and human experts. We also ran PARM approach to learn pair-wise 
prerequisite relationships from these data sets. Given minsup = 
0.125, minconf = 0.76 and minprob = 0.9, 2_5 > 2_6,2_5 > 2_7 
and 2_6 — 2_7 are discovered for Math-chap2, 3_1 — 3_3 and 
3_2 — 3_3 are discovered for Math-chap3. These relationships are 
small subset of the set of relationships discovered by COMMAND. 


Predictive Performance. COMMAND outputs a Bayesian net- 
work model that can be used for inference and predictive modeling. 
For example, given a student’s response to a set of items, we can 
infer the student’s knowledge status of a skill. We could use COM- 
MAND to identify students that may need remediation because they 


7Here we assume the items are single-skilled despite that they might 
be multi-skilled. 


Skill ID Skill Name 


2.2 Symbols and Sets of Numbers 


2.3 Fractions and Mixed Numbers 


2_4 Exponents, Order of 
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25 Adding Real Numbers 

2.6 Subtracting Real Numbers 

2_7 Multiplying and Dividing Real 
Numbers 
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3.1 Simplifying Algebraic Expressions 

3.2 The Addition and Multiplication 
Properties of Equality 

3_3 Solving Linear Equations 

3_4 An Introduction to Problem 
Solving 

35 Formulasand Problem Solving 

3.6 Percent and Mixture Problem 
Solving 

3_8 Solving Linear Inequalities 


(b) Prerequisite structure learned for Math-chap3. 


Figure 12: Prerequisite structures constructed by COMMAND for 
Math data sets. 


lack some background. We evaluate the accuracy of the predicted 
student performance on an item, when we observe the student re- 
sponse on the other items. More precisely, we compute the posterior 
probability of a student’s response to an item J; given his perfor- 
mance on all other items I_; = I \ {/;}, by marginalizing over the 
set of latent variables S: 


PUL =i) a) PE SIH i). 
S 


This probability can be computed efficiently using the Junction 
tree algorithm [11]. We then do binary classification based on the 
posterior probability to determine if the student is likely to answer 
correct. We compare the Bayesian network models generated from 
COMMAND with five baseline predictors: 


e A majority classifier which always classifies an instance to 
the majority class. For example, if majority of the students 
get an item wrong, other students would likely get it wrong. 


e A Bayesian network model in which the skill variables are 
disconnected. This model assumes that the skill variables are 
marginally independent of each other. Most existing knowl- 
edge tracing approaches make this assumption. 


e A Bayesian network model in which the skill variables are 
connected in a chain structure, i.e., 2-2—2-3-42-4— ... This 
assumes that a section (skill) only depends on the previous sec- 
tion. In other words, a first-order Markov chain dependency 
structure. 


e A Bayesian network model constructed using the pairwise 
relationships output from PARM. That is, we create an edge 
S; — S; if PARM says S; is the prerequisite of Sj. 
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e A fully connected Bayesian network where skill variables work is that we develop a methodology to evaluate prerequisite 


are fully connected with each other. This model assumes structures on real student data. We believe that we are the first 
no conditional independence between skill variables and can to compare prerequisite discovery strategies by how well they can 
encode any joint distribution over the skill variables. However, be used to predict student performance. Therefore, we validate 
it has exponential number of free parameters and thus can COMMAND not only with synthetic data, but with two real-world 
easily overfit the data. datasets. Our results suggest that COMMAND improves on the state 
of the art because it significantly improves on a recently published 
0.82 0.82 technique. 
0.8 0.8 
2 es s he Learning a prerequisite graph is not merely discovering a Bayesian 
z o7 Zz 074 network— equivalent Bayesian network structures in fact represent 
it : i se i different prerequisite structures. We believe we are the first to 
—_ —_ address this problem. We use domain knowledge to refine the 
“Cop eon tin aug tog “ee Pease” Agog i, Magi cy prerequisite models output using a theoretically motivated method. 
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